Intro to NGS processing

James A. Fellows Yates

2021-08-17

Who am I?

  • Education
    • B.Sc. Bioarchaeology (University of York, UK)
    • M.Sc. Naturwissenschaftliches Archäologie (University of Tübingen, DE)
    • Ph.D. Archaeogenetics (MPI-SHH / MPI-EVA, DE)
  • Experience
    • Number of genetics classes taken: 0
    • Number of bioinformatics classes taken: 0

@jfy133

Today we will

  1. Describe basics of DNA
  2. Introduce what DNA sequencing is
  3. Explain how Illumina NGS sequencing data is generated
  4. How to evaluating NGS data [Practical]

Introduction DNA

What is DNA?

Deoxyribonucleic acid (/diːˈɒksɪˌraɪboʊnjuːˌkliːɪk, -ˌkleɪ-/ (DNA) is a molecule composed of two polynucleotide chains that coil around each other to form a double helix carrying genetic instructions for the development, functioning, growth and reproduction of all known organisms and many viruses. - Wikipedia

What is DNA?

Structure ADN

What is DNA?

Structure ADN

The rules

  • Four nucleotides
    • Pyrimidines: Cytosine, Thymine
    • Purines: Guanine Adenine &
  • Base pairing: one pyrimidine with one purine
    • C with G (think: CGI)
    • A with T (think: AT-AT walker)
  • Complementary
    • C on one strand, G on the other (or v.v.)
    • A on one strand, T on the other (or v.v.)

AT-AT Walker

The rules

  • Make copy of a DNA strand with a polymerase
    • Unwind the DNA
    • Separate the strands
    • Make new strand: find a C, get new G (etc)

DNA replication split

How do we get DNA?

Figure 17 01 02

Introduction to DNA Sequencing

What is Sequencing?

Converting the chemical nucleotides of a DNA molecule

to

ACTG on your computer screen

Historically

  • Sanger sequencing

Sanger-sequencing

  • Separate strands, add primer (starting point)
  • Add mix of nucleotides, some with special ‘terminators’
  • Pass through size-filtering, read order of terminators

Pros and cons of Sanger Sequencing

  • Pros
    • Very precise (few errors, still the ‘gold standard’)
    • Sequence long DNA molecules
  • Cons
    • Resource heavy, requiring lot of input DNA
    • Slow: one. fragment. at. a. time.

What is NGS?

  • “Next Generation Sequencing”
    • Sequence millions and even billions of DNA reads at once!
    • via MASSIVE multiplexing!
    • Sequence lots of samples at once!
    • Fast and cheap!

Not really ‘next’ anymore, consider it more ‘second’ generation (see: Nanopore)

What is NGS?

Market leader:

Illumina HiSeq 2500

(Others: Roche 454, PacBio, IonTorrent etc.)

How does it work?

  • Basically same concept, but:
    • no size separation
    • with pretty pictures!

i.e. to a strand, attach a complementary fluorophore-modified nucleotide, (normally) one colour per base

A

G

T

C

Fire mah lazer, and take a picture! Rinse and repeat!

How does it work?

via Gfycat

Where does this happen?

On a ‘flow cell’

Next generation sequencing slide

Where does this happen?

But how do you get your DNA to attach to the lawn

(and not get lost)?

  • Convert it to library:
    • Add adapters: bind to the ‘lawn’ of the flow cell
    • Add indexes: sample-specific barcode
    • Add priming sites: where enzymes start copying DNA

AATGATACGGCGACCACCACaccgacaaCCCTACACGACGCTCTTCCGATCTXXXXXXAGCACACGTCTGAACTCCAGTCACgacactaCCGTCTTCTGCTTG ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| TTACTATGCCGCTGGTGGTGtggctgttGGGATGTGCTGCGAGAAGGCTAGAXXXXXXTCGTGTGCAGACTTGAGGTCAGTGctgtgatGGCAGAAGACGAAC

[Adapter & Index Primer] [Index] [Target primer] [Target] [Target primer] [Index] [Adapter & Index Primer]

Sequencing-by-synthesis

Once bound, florescence of one molecule not enough…

Cluster Generation

Make lots of copies, a.k.a. clustering! One cluster == many copies of one DNA molecule

Sequencing-by-synthesis

Cluster Generation

  1. Add florescent nucleotides (complementary will bind)
  2. Fire laser & take photo
  3. Wash away unbound nucleotides
  4. Remove fluorophore
  5. Back to 1 ⤴️

What does this look like?

Cluster Generation

Improving quality

  • Over time, imaging reagents get ‘tired’ and more errors occur

    • Polymerases aren’t perfect
    • Bases sometimes don’t bind, or multiple == clusters ‘desynced’
    • With each ‘photo’, machine calculates probability it got the ‘right’ nucleotide
    • Each photo of each cluster gets a ‘base-quality’ score
  • What if molecule is longer than cycles of imaging?

  • Improvement: paired-end sequencing

    • Get order of nucleotides by sequencing from one end
    • Get reverse order of nucleotides, by sequencing from the other end

Paired end sequencing

MiSeq™, HiSeq™ 1000/1500/2000/2500 and NovaSeq™ 6000 v1.0 reagents paired-end flow cell, © 2021 Illumina, Inc. All rights reserved. Used here for training purposes only

© 2021 Illumina, Inc. All rights reserved. Used here for training purposes only.

Photos to DNA string

  • Special software (e.g. bcl2fastq):

  • For each location on the flow cell (cluster):

    • Record the sequence of bases (from colours)
    • Calculates a probability the ‘base call’ is correct i.e. blurry or weak image?
    • Note the index in the sequence (sample-specific barcode)
  • Group each recorded sequence or ‘reads’ with those with the same index

    • a.k.a. demultiplexing

FASTQ File

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. - Wikipedia

FASTQ File

Example

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72                      # Read ID
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGGTTTTCAAATAGA     # DNA sequencing
+                                                                            # Separator
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/     # Quality line
@SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=72
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGAAGCAGAAGTCGATGATAATACGCGTCGTTTTATCAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBIIIIIIIIIIIIIIIIIIIIIIIGII>IIIII-I)8I
@read id

Quality score

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
0.2......................26...31........41          

Cons of NGS sequencing

  • less accurate (laser/photo can get wrong)
  • chemistry limits (DNA strands gets old through heat cycling for denaturing; dirty environment from suboptiomal wash steps etc.) mean short reads (compensated by volume)

Things to remember

  • Indexs
  • Adapters
  • Cycle-quality decay
  • paired-ends!